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Mesurer avec precision la dynamique des graphes de terrain est une tache difficile, car les proprietes observees peuvent 
etre biaisees pour differentes raisons, en particulier le fait que la periode de mesure soit finie. Dans ce papier, nous 
introduisons une methodologie generale qui nous permet de savoir si la fenetre d'observation est suffisamment longue 
pour caracteriser une propriete donnee dans n'importe quel systeme dynamique. 

Nous appliquons cette methodologie a I'etude des durees de sessions et des durees de vie des fichiers sur deux jeux 
de donnees P2P. Nous montrons que le comportement des proprietes est different : pour les durees de sessions, notre 
methodologie nous permet de caracteriser avec precision la forme de leur distribution. Par centre, pour les durees de 
vie des fichiers, nous montrons que cette propriete ne peut pas etre caracterisee, soit parce qu'elle n'est pas stationnaire, 
soit parce que la duree de notre mesure est trop courte. 

1 Introduction 

Many systems are naturally dynamic. For instance in the internet, routers, AS and/or links between them are 
created or deleted [MOVL09] ; in peer-to-peer (P2P) networks users join or leave the system [SR06, SGG03, 
LBFM09] and exchange different files at different times. In all these cases, understanding the dynamics of 
the system is a key issue. However, accurately measuring these dynamics is a difficult task. In particular, 
the fact that the observation window is necessarily finite induces a bias for property characterization [SR06, 
SGG03]. Though this bias tends to decrease when the observation window length increases, it is difficult to 
quantify it in practice, or know whether it is negligible or not. 

In this paper, we introduce a new methodology that allows to rigorously determine the minimum observa- 
tion time required to characterize a stationary property in real-world dynamic systems. This methodology 
is different and complementary to other methodologies existing in the literature [SR06, SGG03, GT99], 
and has two main advantages. First, it allows to determine if the observation window was long enough 
for a rigorous characterization. Second, it can be applied to any property characterizing the dynamics of a 
system. To illustrate its relevance, we apply it to the study of session lengths and files' life duration in two 
different P2P systems. 



2 Methodology 

Suppose we start observing a dynamic graph at a time t, for a duration /. We denote by Wtj this observation 
window. We are faced with two problems if we want to characterize the graph's dynamics from the obser- 
vation of Wt J. First, / must be long enough for W,./ to be representative. Second, even if it is representative, 
the fact that / infinite still induces a bias for propeity characterization. Indeed, events occurring before t or 
after f + / are not observed, which prevents from characterizing accurately some quantities. An important 
point to observe is that the longer the measurement period, the smaller the bias induced. 
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Our methodology addresses these two issues at the same time. Intuitively, it aims at deciding if the 
measurement period Wt^i is long enough to characterize a given property P, i.e. if the bias induced by its 
finiteness on the observed property is negligible. If the window W, / is long enough, then if we use a longer 
window Wtj+x, the observed property does not change: P{W,j) = P{Wtj+x)- In order to know if a given 
window is long enough, we use windows of increasing length Wo l^ ,Wo,i2' ■■■ ^0,/„> with h < I2 < ■■■ < In- 
studying how the observed property P(Wb,/j),/^(Wb,;2). ■■■P{Wo,i„) evolves as a function of /, we determine 
if it is correctly evaluated or not. 

Finally, an important point is that characterizing a property P only makes sense if it is stationary, i.e. if 
P does not evolve while the measurement is under progress. Notice however that if it is not stationary, our 
methodology will not be able to provide a characterization: the observed property P will not become stable 
when the observation window length I increases. If it does become stable, this means both that W,j is long 
enough, and that P is stationary. Notice that, depending on the property studied, other types of bias can 
occur, see for instance [SR06], including biases coming from the identification of users and their sessions. 
We will also rigorously take this into account, see Section 4. 1 . 

Here, most of the properties we study are complementary cumulative distributions, i.e. for each value k, 
Pk is the fraction of all observations values which are larger than or equal to k. 

To study how an observed distribution P evolves with the length of the observation window, we will first 
plot the observed distributions P{W,^i) for different values of /. In order to confirm more formally the visual 
observations, we will also study a statistical indicator which quantifies how close two distributions P and 
Q are to each other: the Monge-Kantorovich distance, or M-K distance [GKT09] compares two normalized 
cumulative (complementary or not) distributions P and Q. It is equal to the mean of the distance between 
the two distributions: MK{P,Q) = (I^ |/\ - QiD/W- 

We use this indicator to study how the observed distribution P{Wt^i) evolves: we compute the M-K 
distance between /'(Wq,/) (with different values of /) and ^'(Wb,/^^), where Imax is the length of the longest 
observation window for this dataset, and plot this as a function of /. Following [WAL04], we also study the 
mean and the standard deviation of P{Woj) as a function of /. 

3 Data 

In order to show the relevance of our methodology, we use two datasets: the queries dataset which is a 
capture of the UDP traffic of a large eDonkey server [ALM09]. It consists of the queries made by users (for 
lists of files matching certain keywords, or for providers for a given file), and of the server's answers to 
these queries. The measurement lasted for 10 weeks which represents 1 billion messages, with 89 milUon 
peers and 275 million files involved. The logins dataset consists in a trace of the login and logout of peers 
on the eDonkey network [LBFM09]. It contains more than 200 milUons of connections by more than 14 
millions of peers, over a period of 27 days. The two datasets are therefore complementary. 

4 Users' session lengths 

4. 1 Definition of a session 

We do not formally know when user sessions begin or end in the queries dataset, because there is no notion 
of session in the UDP eDonkey protocol. Instead, users make stand-alone queries and receive answers from 
the server. We therefore have to infer sessions from these queries. 

It is natural to consider that two consecutive queries made by a same user belong to the same session 
(whether they are for a same file or not) if the time elapsed between them is short, and belong to two 
different sessions if it is long. The question is then to find an appropriate threshold for distinguishing 
between these two cases. Based on the study of the inter-query time distribution (not presented here), we 
have chosen to use a threshold of 10800 seconds, i.e. 3 hours. 
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4.2 Characterization of session lengths 

We now apply our methodology to the study of the session length distributions S, by studying 5 (Wo,/) for 
different values of /. 
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Figure 1: Complementary cumulative distributions of 
5(Wb_/) for different observation windows lengths in log- 
lin scale, for the queries dataset. 
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Figure 2: Complementary cumulative distributions of 
S{Wo^l) for observation windows lengths 1=1 week and 
/ = 10 weeks in Un-log scale, for the queries dataset. 
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Figure 1 shows the complementary cumulative distribution S(Woj) for different values of I, up to / = 
10 weeks, for the queries dataset. The shapes of these distributions are similar, with a small fraction of 
sessions with length smaller than 2000 s, and an approximately linear shape between 2000 s and 100000 s. 
However, when / < 1 day, the distributions exhibit a clear cut-off. This is not the case anymore for I >4 
days: the tail of the distribution flattens after a bend occurring close to 100000 s (~ 28 hours), and we 
observe a small fraction of extreme values after this bend. For observation windows larger than four days, 
the shape of the distribution does not seem to evolve anymore: the distributions corresponding to / = 1 
week and / = 10 weeks (presented in the inset) are very similar to each other and to the one obtained for 
1 = 4 days. 

One must be however careful when driving conclusions from 
a visual examination. Indeed, if we observe the same plot as the 
inset of Figure 1 but with a linear scale on the x-axis and a loga- 
rithmic scale on the y-axis (see Figure 2), the distributions seem 
visually strongly different from each other. However, the distri- 
butions are different only for less than 1 % of the values, which 
are values after the bend in Figure 1 and are extreme values. The 
fact that the extreme values change when I increase shows that 
they cannot be characterized with our methodology, and we leave 
their study for further work. 

To confirm these observations, we study MK(5(Wo./), S{Wq,i^^^^)) as a function of Z, presented in Figure 3. 
The values observed tend to decrease (with fluctuations) until the observation window reaches approxi- 
mately 150 hours (6 days and 6 hours). After this, the value of the M-K distance becomes very small: this 
shows that the corresponding distributions are very close to each other 

We also studied the standard deviation and the mean of 5(Wb,/) as a function of / (not presented here). 
We observe that the mean becomes stable once / reaches approximately 1 week, at the same time as the 
M-K distance. This confirms that an observation window of one week is long enough to accurately estimate 
the distribution. The standard deviation, however, does not seem to converge as the observation window 
length increases, confirming that the distribution cannot befiilly characterized. This is consistent with the 
distinction between the normal part of the distribution and extreme values. 

Figure 4 shows the complementary cumulative distribution S{Woj) for different values of /, up to / = 3 
weeks, for the logins dataset. We can see that the shape of these distributions are similar, and get closer 
to each other as I increases. However, when we compare these distributions with the M-K distance (see 
Figure 5), the values obtained tend to decrease Unearly which means that the distributions change at a 
constant rate. The values obtained for the mean and the standard deviation also do not stabilize. Therefore, 
we can not fully characterize this distribution. We however have confidence that the true shape of the 
distribution is not far from the one we observed. 
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Figure 3: MK(S{Wo,i), S{Wo,i^)) as a 
function of /, for the queries dataset. 
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Figure 4: Complementary cumulative distributions of 
S{Wq^i) for different observation windows lengths, for the 
logins dataset. 
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We considered two different definitions for a files' lifetime F . The first one is the same as for users' 
sessions lengths: we use a threshold and consider that a file is not present in the system if there is no 
consecutive queries for this file separated by less than this threshold. The second definition consists in 
considering the time interval between the first and the last query for a given file. In both cases, the shape of 
distributions F{Wq^i) (not presented here) evolves strongly with /. We therefore conclude that this property 
cannot be characterized. The question which arises is whether this is because this property is intrinsically 
not stationary or because our measurement period is too short to be able to characterize it. 



6 Conclusion 

In this paper we introduced an empirical methodology for deciding when the bias induced by the finiteness 
of observation windows in dynamic systems becomes negligible. To illustrate the relevance of this approach, 
we applied it to the study of sessions lengths and files' life duration in two different datasets. 

We have shown that we can characterize some properties, but not all. When a property can't be charac- 
terized, our methodology doesn't allow to determine if the observation window shall be increased or not 
since we don't know the stationarity of the property itself. It is interesting to note that, for a same dataset, 
some properties can be accurately characterized, and others not. 
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